Attention Mechanism In Transformer

The attention mechanism is designed to focus on different parts of the input data depending on the context . The strength of...

Gaurav

February 10, 2024

Attention Mechanism:🔗

The attention mechanism is designed to focus on different parts of the input data, depending on the context. In the context of the Transformer model, the attention mechanism allows the model to focus on different words in the input sequence when producing an output sequence. The strength of the attention is determined by a score, which is calculated using a query, key, and value.

Attention Mechanism

The Basic Attention Equation:🔗

Given a Query ( $Q$ ), a Key ( $K$ ), and a Value ( $V$ ), the attention mechanism computes a weighted sum of the values, where the weight assigned to each value is determined by the query and the corresponding key.

The attention score for a query $Q$ and a key $K$ is calculated as:

\text{score}(Q, K) = Q \cdot K^T

This score is then passed through a softmax function to get the attention weights:

\text{softmax}\left(\frac{\text{score}(Q, K)}{\sqrt{d_k}}\right)

Where $d_k$ is the dimension of the key vectors (this scaling factor helps in stabilizing the gradients).

Finally, the output is calculated as a weighted sum of the values:

\text{output} = \text{attention weights} \times V

2. Matrix Calculation of Self-Attention:🔗

In practice, we don’t calculate attention for a single word, but rather for a set of words (i.e., a sequence). To do this efficiently, we use matrix operations.

Step 1: Calculate Query, Key, Value matrices🔗

Given an input matrix $X$ (which consists of embeddings of all words in a sequence), and weight matrices $W_Q$ , $W_K$ , and $W_V$ that we've trained:

Q = X \times W_Q

K = X \times W_K

V = X \times W_V

Step 2 to 5: Compute the Output of Self-Attention Layer🔗

Given the matrices $Q$ , $K$ , and $V$ that we've just computed:

Calculate the dot product of $Q$ and $K^T$ to get the score matrix:

\text{Score} = Q \times K^T

Divide the score matrix by the square root of the depth $d_k$ :

\text{Scaled Score} = \frac{\text{Score}}{\sqrt{d_k}}

Apply the softmax function to the scaled score matrix:

\text{Attention Weights} = \text{softmax}(\text{Scaled Score})

Multiply the attention weights by the value matrix $V$ :

\text{Output} = \text{Attention Weights} \times V

This output is the result of the self-attention mechanism for the input sequence.

Multi-Head Attention:🔗

In multi-head attention, the idea is to have multiple sets of Query, Key, Value weight matrices. Each of these sets will generate different attention scores and outputs. By doing this, the model can focus on different subspaces of the data.

Multi-Head Attention

Let's denote the number of heads as $h$ .

For each head $i$ , we have its own weight matrices: $W_{Q_i}$ , $W_{K_i}$ , and $W_{V_i}$ .
For each head $i$ , compute the Query, Key, and Value matrices just like in single-head attention calculate for each head from i to h.

Q_i = X \times W_{Q_i}

K_i = X \times W_{K_i}

V_i = X \times W_{V_i}

Using the $Q_i$ , $K_i$ , and $V_i$ matrices, we calculate the output for each head:

\text{Score}_i = Q_i \times K_i^T

\text{Scaled Score}_i = \frac{\text{Score}_i}{\sqrt{d_k}}

\text{Attention Weights}_i = \text{softmax}(\text{Scaled Score}_i)

\text{Output}_i = \text{Attention Weights}_i \times V_i

Now, after obtaining the output for each head, we need to combine these outputs to get a single unified output.

Concatenation & Linear Transformation: The outputs from all heads are concatenated and then linearly transformed to produce the final output:

\text{Concatenated Output} = \text{concat}(\text{Output}_1, \text{Output}_2, ..., \text{Output}_h)

\text{Final Output} = \text{Concatenated Output} \times W_O

Where $W_O$ is another trained weight matrix.

This multi-head mechanism allows the Transformer to focus on different positions with different subspace representations, making it more expressive and capable of capturing various types of relationships in the data.

Buy Book to read more

COMING SOON ! ! !

Till Then, you can Subscribe to Us.

Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!

ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.